Dataset

You (The Wholesale Customer Dataset from UCI Machine Learning Repository1) are a wholesale distributor in 3 regions: “Oporto” with 47 customers, “Lisbon” with 77 customers and “Other” with 316 customers, among them 298 are from the “Cafe” (including Hotels and Restaurants) distribution channel and 142 from “Retail”.

library(cluster)
library(dplyr)
library(GGally)
library(ggplot2)
library(knitr)
library(pander)
library(MASS)
library(reshape2)
library(ROCR)
sale = read.csv("WholesaleCustomer.csv", header=TRUE)
colnames(sale)[7] = "Detergents" # rename "Detergents_Paper" for brevity
sale = within(sale, {Channel[Channel == 1] <- "Cafe" 
                     Channel[Channel == 2] <- "Retail" 
                     Region[Region == 1] <- "Lisbon" 
                     Region[Region == 2] <- "Oporto" 
                     Region[Region == 3] <- "Other" } )
sale[,c("Channel", "Region")] = lapply(sale[,c("Channel", "Region")], as.factor)

Here is few rows from your annual revenue dataset:

kable(head(sale), align="c")
Channel Region Fresh Milk Grocery Frozen Detergents Delicassen
Retail Other 12669 9656 7561 214 2674 1338
Retail Other 7057 9810 9568 1762 3293 1776
Retail Other 6353 8808 7684 2405 3516 7844
Cafe Other 13265 1196 4221 6404 507 1788
Retail Other 22615 5410 7198 3915 1777 5185
Retail Other 9413 8259 5126 666 1795 1451

Here is how many customers with a specific revenue range for each product for each “Cafe” or “Retail” group you have

Working with log-revenues instead of revenues

The first important thing to notice is that the histograms look reasonably bell-shaped at the log-scale of revenues. Thus we transform our data from revenues to log-revenues.

Do log-numbers look unusual to you? Then you may ask: are there any other advantages of using those? Yes, log-revenues automatically care about the “percentage” difference between revenues (120 is 20% larger than 100, and 1200 is also 20% larger than 1000) rather than absolute difference. You can also think of log-value as “order of magnitude” or “count of digits in a number”. Log-revenue of 2 means revenue of 100, log-revenue of 3 means revenue of 1000, log-revenue of 4 means revenue 10000, log-revenue of 2.5 means revenue is closer to 1000 than to 100.

products = colnames(sale[,3:8]); #products
sale$Total = rowSums(sale[,products]) # total revenue
l1sale = sale;  l1sale[,3:9] = log10(1.+(sale[,3:9])) # df with logs

Analysis of variance

From the histograms above it looks like that mean revenues of Milk, Grocery and Detergent (but not necessarily of Fresh, Frozen and Delicassen) are significantly different for Cafe and Retail customers. There are parametric and non-parametric ways to test this hypothesis, with the key word being ANOVA=Analysis of variance. In such analysis the groups are considered and given and revenues as dependent variables.

Strictly speaking, ANOVA makes an assumptions on normality(which should rather be called “gaussianity”), homoscedasticity, and independence of observations. For the data set under consideration there is no doubt about independence. The other two properties are not necessarily satisfied. Practically speaking, there is a growing discussion, that many statistical methods are robust to deviations from gaussianity, and a reasonable bell-shape form of the histogram (as in our case) suffice for application of those methods. Even more, there are opinions, that normality tests should actually never be applied.

Anyway one can first resort to non-parametric Kruskal-Wallis test. Small p-values (smaller than a chosen threshold, typically 0.05) show that we can reject the hypothesis that the mean values of revenues are the same for Retail and Cafe customers. This is so for each product separately as well as for the total revenue.

apply(l1sale[,3:9], 2, function(x) kruskal.test(x ~ Channel, data=l1sale)$p.value)
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##   1.83e-04   7.67e-38   6.60e-50   7.39e-07   4.56e-55   5.49e-04   1.72e-24

Two-sample t-test assumes gaussian distributions, but not necessarily homoscedasticity. It leads to similar results

apply(l1sale[,3:9], 2, function(x) t.test(x ~ Channel, data=l1sale, var.equal=FALSE)$p.value)
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##   5.76e-04   3.26e-46   1.84e-68   1.39e-07   2.26e-88   5.14e-03   4.70e-28

If we consider classical anova applicable, the results are again similar

apply(l1sale[,3:9], 2, function(x) anova(lm(x ~ Channel, data=l1sale))[["Pr(>F)"]][1])
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##   3.56e-04   1.84e-40   5.34e-57   1.94e-07   1.51e-73   5.84e-03   1.49e-22

Notice, however, that p-values for Fresh, Frozen and Delicassen are more than 25 orders of magnitude higher than for Milk, Grocery and Detergents. This reflects somehow our intuition from the histograms.

The differences between groups can be illustrated with boxplots (keep in mind though that the central line in boxplot shows median, not the mean value)

For different regions, p-values are large. Thus we cannot reject the hypothesis than mean revenues are the same

apply(l1sale[,3:9], 2, function(x) kruskal.test(x ~ Region, data=l1sale)$p.value)
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##      0.586      0.482      0.202      0.239      0.797      0.890      0.778

Similar to the two-sample t-test, three groups can also be tested to have the same means

apply(l1sale[,3:9], 2, function(x) oneway.test(x ~ Region, data=l1sale, var.equal=FALSE)$p.value)
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##      0.713      0.609      0.117      0.151      0.708      0.947      0.903

Results from the classical anova are also similar

apply(l1sale[,3:9], 2, function(x) anova(lm(x ~ Region, data=l1sale))[["Pr(>F)"]][1])
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen      Total 
##      0.658      0.623      0.211      0.193      0.704      0.957      0.917

That difference between groups is small can be illustrated with the box- and violin-plots.

Here is the revenue comparison for different channels and different regions

Discriminant analysis

From the histograms above it looks like that the group of Retail customers differ from the group of Cafe customers in the revenues of Milk, Grocery and Detergent, while not necessarily in Fresh, Frozen and Delicassen. Is it indeed so? To formulate this question more precisely: what combinations of revenues allows one to maximally separate various groups? In such analysis the revenues are considered and given and groups as dependent variables.

LDA = Linear discriminant analysis

The most popular method is LDA. Omitting a discussion on it assumptions (multivariate gaussianity and identical covariance matrices for each group), I proceed with finding the linear combination that best separated classes

lda_ch = lda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=FALSE) 
lda_ch$scaling
##                LD1
## Fresh      -0.0966
## Milk        0.4631
## Grocery     0.6784
## Frozen     -0.3453
## Detergents  1.3510
## Delicassen  0.0145

The largest contributions to it are coming from Milk, Grocery and Detergent. Keeping in mind our histograms, this does not come as a surprise. Here is the density plot versus this linear discriminant “LD1”

plot(lda_ch)

The mean values of each log-revenues for each group are

lda_ch$means
##        Fresh Milk Grocery Frozen Detergents Delicassen
## Cafe    3.87 3.34    3.45   3.27       2.58       2.85
## Retail  3.64 3.92    4.13   2.97       3.73       3.00

The quality of separation can be estimated by the ratio of the between- and within-group standard deviations

lda_ch$svd
## [1] 23.7

To estimate the quality of classification produced by the LDA we should use not the training set, but resort to cross validation. Since leave-one-out cross validation is built-in, let us use it. Here are the confusion matrix and the total percent of correctly classified customers

lda_ch_cv = lda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=TRUE) 
ct_lda <- table(l1sale$Channel, lda_ch_cv$class); ct_lda # confusion matrix
##         
##          Cafe Retail
##   Cafe    273     25
##   Retail   13    129
#diag(prop.table(ct, 1))  # percent correct for each Channel
sum(diag(prop.table(ct_lda))) # total percent correct
## [1] 0.914

This method can also provides the posterior probabilities of belonging to each group. Thus, we can asses the quality of the model with respect to more refined measures. Below, after discussing QDA, I will plot the ROC curve and calculate AUC.

QDA = Quadratic discriminant analysis

QDA differs from LDA in that it does not assume a common covariance matrix for all classes. This leads to more free parameters and thus to larger flexibility of the method. Since QDA uses quadratic functions, there are no easily interpretable linear discriminants. We can use this method to make predictions for unknown samples (here via leave-one-out cross validation). This results in the following confusion matrix and the total percent of correctly classified customers. The results appears to be a little worse than those for LDA

qda_ch_cv = qda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=TRUE, method="mle")
ct_qda <- table(l1sale$Channel, qda_ch_cv$class); ct_qda # confusion matrix
##         
##          Cafe Retail
##   Cafe    269     29
##   Retail   13    129
#diag(prop.table(ct, 1))  # percent correct for each Channel
sum(diag(prop.table(ct_qda))) # total percent correct
## [1] 0.905

Comparing ROC curves

roc_auc = function(da){
    pred = prediction(predictions=da$posterior[,"Retail"], 
                      labels=as.numeric(l1sale$Channel=="Retail"))
    perf <- performance(pred, measure = "tpr", x.measure = "fpr")
    roc.auc = performance(pred, measure="auc"); 
    return(list(perf=perf, auc=roc.auc@y.values[[1]]))
}

lda_roc_auc = roc_auc(lda_ch_cv)
qda_roc_auc = roc_auc(qda_ch_cv)

cutoffs = c(0.2,0.4, 0.5, 0.6, 0.8)
plot(lda_roc_auc$perf, colorize=TRUE, lwd=3, 
     main="ROC curve for LDA and QDA prediction of Retail channel",
     print.cutoffs.at=cutoffs, text.adj=c(1.0,-0.5))
plot(qda_roc_auc$perf, colorize=TRUE, lwd=3, main="ROC curve for LDA prediction of Retail channel", 
     add=TRUE)
text(-0.03, 0.99, "LDA", pos=4)
text(0.1, 0.85, "QDA", pos=4)
text(0.15,0.6,paste0("LDA  AUC = ", round(lda_roc_auc$auc,3)), pos=4)
text(0.2,0.5,paste("cutoff positions at", paste(cutoffs, collapse=", ")),  pos=4)
text(0.15,0.3,paste0("QDA  AUC = ", round(qda_roc_auc$auc,3)), pos=4)

The results of discriminant analysis should typically be compared with other classification methods such as logistic regression, support vector machines, neural nets.

Factor analysis

Revenues for some products are correlated with each other

l1sale.cor = cor(l1sale[,products]); l1sale.cor
##              Fresh    Milk Grocery  Frozen Detergents Delicassen
## Fresh       1.0000 -0.0211  -0.133  0.3863     -0.159      0.256
## Milk       -0.0211  1.0000   0.761 -0.0552      0.679      0.342
## Grocery    -0.1330  0.7611   1.000 -0.1645      0.797      0.240
## Frozen      0.3863 -0.0552  -0.165  1.0000     -0.213      0.256
## Detergents -0.1587  0.6787   0.797 -0.2128      1.000      0.168
## Delicassen  0.2564  0.3423   0.240  0.2563      0.168      1.000

Let us try to determine the (independent) latent variables responsible for the variety of the observed revenues. Suitable here is exploratory factor analysis.

faa2 = factanal(l1sale[,products], factors=2, method="mle", rotation="varimax", scores="regression")
print(faa2, digits = 2, cutoff = .2, sort = TRUE)
## 
## Call:
## factanal(x = l1sale[, products], factors = 2, scores = "regression",     rotation = "varimax", method = "mle")
## 
## Uniquenesses:
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen 
##       0.65       0.30       0.13       0.61       0.26       0.65 
## 
## Loadings:
##            Factor1 Factor2
## Milk        0.83          
## Grocery     0.93          
## Detergents  0.84          
## Fresh               0.58  
## Frozen              0.61  
## Delicassen  0.31    0.51  
## 
##                Factor1 Factor2
## SS loadings       2.38    1.01
## Proportion Var    0.40    0.17
## Cumulative Var    0.40    0.57
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 4.41 on 4 degrees of freedom.
## The p-value is 0.353

The chi-square statistic and p-value in “factanal” are the results of testing the hypothesis that the model fits the correlation matrix perfectly. When the p value is large (larger than your confidence level, typically 0.05), as it is here, we cannot reject this hypothesis. Thus, 2 factors are enough to explain the empirical covariance matrix.

Let us check how well the true correlation matrix is approximated by the fitted one by computing difference between them. As expected, all differences are very small.

faa2.cor.approx <- faa2$loadings %*% t(faa2$loadings) + diag(faa2$uniquenesses)
round(l1sale.cor - faa2.cor.approx, 3)
##             Fresh   Milk Grocery Frozen Detergents Delicassen
## Fresh       0.000 -0.011  -0.001  0.021      0.014     -0.010
## Milk       -0.011  0.000  -0.001 -0.014     -0.003      0.027
## Grocery    -0.001 -0.001   0.000  0.008      0.002     -0.007
## Frozen      0.021 -0.014   0.008  0.000     -0.001     -0.012
## Detergents  0.014 -0.003   0.002 -0.001      0.000     -0.010
## Delicassen -0.010  0.027  -0.007 -0.012     -0.010      0.000

Since factors are determined up to rotations, we have chosen the “varimax” rotation. This sparsifies the contribution of individual products and thus makes interpretation easier.

2 factors can explain 57% of data variance.

The exact contribution of each product to factor is shown here

ggplot(as.data.frame(unclass(faa2$loadings)), 
                           aes(x=Factor1, y=Factor2) ) + geom_text(label=products) + 
            theme_bw(base_size = 16) + ggtitle("Contribution of products to the two latent factors") 

And here are the two factors (found by “factanal”) for each customer

Large uniquenesses show, however, that there is much more in this data than can be explained by two factors.

faa2$uniqueness
##      Fresh       Milk    Grocery     Frozen Detergents Delicassen 
##      0.654      0.296      0.133      0.613      0.263      0.646

For each of the Fresh, Frozen and Delicassen more than 60% of the variability of revenue is not explained by the common factors.

Cluster analysis

Cluster analysis answer the questions: which customers belong together and can be combined in a group such that this group is noticeably different from other groups (formed on the same principle). Here I choose to work only with revenue data, in order to be able to check how the found clusters are related to known distribution channels and regions.

Cluster analysis is very sensitive to the meaningful distance between customers. And defining what “meaningful distance between customers” is, is crucial for inferring business value.

A business user should ask himself

A mathematically aware business user should also ask himself

Agglomerative hierarchical clustering suggest 2 or 3 clusters

As good business user as I am, I am choosing to care about the difference to the mean, reduce dimensionality to 4, and use manhattan distance.

scaled.pca4 = princomp(scale(l1sale[,products]))$scores[,1:4]
di = dist(scaled.pca4, method="manhattan")  # manhattan distance

For agglomerative hierarchical clustering I choose “Ward” clustering method. It minimizes the total within-cluster variance and thus leads to most compact clusters possible.

ag = agnes(di, diss=TRUE, method="ward")
plot(ag, which=2, labels=FALSE, xlab="", sub="", 
     main="Dendogram: 4-comp PCA, manhattan distance, Ward agglomeration") # dendrogram

The dendrogram suggest that 2 or 3 clusters are meaningful. Let us investigate 3 clusters

l1sale$ag3 <- factor(cutree(ag, k = 3))
table(l1sale$Channel, l1sale$ag3, dnn=c("Channel", "ag3") )
##         ag3
## Channel    1   2   3
##   Cafe    22  50 226
##   Retail  56  81   5

The third cluster consists mainly of Cafe customers, while the first and the second one include both Cafe and Retail. What about the revenues for each product?

This set of boxplots shows that Group 3 (nearly all of those Cafe customers) buy less Milk, Grocery and Detergents than the other two groups. Among groups with mixed Cafe and Retail customers, groups 1 (30% Cafe 70% Retail low-revenue customers) buys less Fresh, Frozen and Delicassen that group 2 (40% Cafe 60% Retail high-revenue customers). That’s how these groups look like, when I choose axis as sum of revenues

Thus, the 3 groups found are

  1. 40/60 mix Cafe Retail with low-to-middle total revenues due to low Fresh, Frozen

  2. 30/70 mix Cafe/Retail with high total revenues

  3. Cafe customers with low total revenues due to low Milk, Grocery and Detergents

Let us depict customers versus easily interpretable axes: sums of revenues of specific products. The 3 “groups” found do not look like compact at all.

Can it be related to the “unfortunate” choice of axes? Let us try with the “most fortunate” possible axis (the first two PCA components)

clusplot(l1sale[,products], l1sale$ag3, color=TRUE, shade=FALSE, labels=3, lines=0,
         col.p=group_color[as.integer(l1sale$ag3)], 
         col.clus=group_color[c(1,3,2)],
         main="3 groups found by hierarchical clustering")

No success either. How shall we interpret this?

A good tool to assess the quality of clustering is the silhouette diagram. It shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters.

si.ag3 = silhouette(as.numeric(l1sale$ag3), di); 
plot.silhouette(si.ag3, col=group_color, 
                main="Silhouette plot for 3 groups from aglomerative clustering")

With more than few negative values the silhouette diagram does not look good. With average silhouette widths of 0.10, 0.29 and 0.25, the clustering should be assessed as “no substantial structure has been found”.

Partitioning around medoinds: a robust version of K-means

Let us look at the 3 cluster from the point of view of K-medoids algorithms. This can be considered as a more robust version of k-means, which searches for k representative objects (“medoids”) among data points. These “medoid customers” should represent the “typical customer” of the group.

pam3 = pam(di, k=3, diss=TRUE, cluster.only=FALSE)
l1sale$pam3 = as.factor(pam3$cluster)
table(l1sale$Channel, l1sale$pam3, dnn=c("Channel", "pam3")) # compare clusters distribution Channels
##         pam3
## Channel    1   2   3
##   Cafe    20 162 116
##   Retail 115   2  25

Groups 2 and 3 consist mainly of Cafe customers, while group 1 from Retail customers. The separation of distribution channels here is better than one achieved in the agglomerative hierarchical clustering. What is the difference in revenues?

Thus, the 3 groups found are

  1. mostly Retail customers with high Milk, Grocery, Detergent revenues

  2. mostly Cafe customers with low Milk, Grocery, Detergent revenues

  3. mostly Cafe customers with medium Milk, Grocery, Detergent revenues

The difference between Fresh, Frozen and Delicassen for all groups is relatively small

Here are the log-revenues of the “typical customers” found

l1sale[pam3$medoids, c("pam3", products, "Total")]
##     pam3 Fresh Milk Grocery Frozen Detergents Delicassen Total
## 417    1  3.64 4.04    4.04   2.93       3.83       3.00  4.54
## 116    2  4.05 2.90    3.48   3.43       2.44       2.79  4.27
## 315    3  4.03 3.25    3.88   3.17       2.93       3.09  4.37

With sums of revenues of specific products on axis, here are the 3 customer “groups”

This is a pretty much different clustering than the one found by the agglomerative method.

Does partitioning around medoids indeed found distinct clusters in this dataset? Silhouette plot (low values of silhouette width) again indicates no significant structure in data.

si.pam3 = silhouette(as.numeric(l1sale$pam3), di)
plot.silhouette(si.pam3, col=group_color, main="Silhouette plot for 3 groups found by PAM")

What to do or is clustering meaningful for this dataset?

Having done the cluster analysis, several questions are pending

  • Do these results actually mean that our customers cannot be structured?

  • Is it meaningful for this data to assume that there are several distinct groups and enforce searching for distinct groups?

If one still wants to search for distinct groups, it may be reasonable to reconsider what the “distance between customers” mean. As it was shown with histograms and boxplots, the main difference lies in 3 products out of 6. At the same time, the distance considered so far tried to take into account all 6. The ways out of this situation (and out of the curse of dimensionality) I will discuss in the more extensive “Wholesale Customers: cluster analysis” write-up. Ideally it would be nice not to guess the appropriate distance matrix, but let the algorithm to determine it.

Here I discuss another question: why are we actually trying so hard to “put a single label” on each customer?

  • I am aiming at more natural “segmentation” of customers, which behave somehow “in the same way” with respect to at least some products (even though there are no distinct groups). Neighbor segments should be similar. Thus it will be not so important to which segment exactly a given customer belongs and how many segments exactly there are.

Customer segmentation

Similar customers are in one segment. Neighbor segments are similar to each other in some products.

The upper circles in this visualization are for Milk, Grocery, Detergents. Large revenues of those, as we know from the previous analysis are typical for Retail customers.

Some segments are pretty interesting. Here we display actual revenues instead of log-revenues

ID Channel Region Fresh Milk Grocery Frozen Detergents Delicassen Total
67 Cafe Other 9 1534 7417 175 3468 27 12630
129 Cafe Other 140 8847 3823 142 1062 3 14017
185 Cafe Other 327 918 4710 74 334 11 6374
204 Cafe Lisbon 583 685 2216 469 954 18 4925
234 Cafe Lisbon 964 4984 3316 937 409 7 10617
413 Cafe Other 97 3605 12400 98 2970 62 19232
440 Cafe Other 2787 1698 2510 65 477 52 7589
ID Channel Region Fresh Milk Grocery Frozen Detergents Delicassen Total
76 Cafe Other 20398 1137 3 4407 3 975 26923
123 Cafe Other 12212 201 245 1991 25 860 15534
162 Cafe Other 12434 540 283 1092 3 2233 16585
357 Cafe Other 22686 134 218 3157 9 548 26752
ID Channel Region Fresh Milk Grocery Frozen Detergents Delicassen Total
17 Retail Other 1020 8816 12121 134 4508 1080 27679
39 Retail Other 4591 15729 16709 33 6956 433 44451
58 Retail Other 5417 9933 10487 38 7572 1282 34729
60 Cafe Other 6137 5360 8040 129 3084 1603 24353
61 Retail Other 8590 3045 7854 96 4095 225 23905
95 Retail Other 5626 12220 11323 206 5038 244 34657
107 Retail Other 1454 6337 10704 133 6830 1831 27289
146 Retail Other 22039 8384 34792 42 12591 4430 82278
156 Retail Other 1989 10690 19460 233 11577 2153 46102
176 Retail Other 2343 7845 11874 52 4196 1697 28007
217 Retail Lisbon 2532 16599 36486 179 13308 674 69778
222 Cafe Lisbon 5396 7503 10646 91 4167 239 28042
246 Retail Lisbon 3062 6154 13916 230 8933 2784 35079
265 Retail Lisbon 1073 9679 15445 61 5980 1265 33503
334 Retail Oporto 8565 4980 67298 131 38102 1215 120291
419 Retail Other 660 8494 18622 133 6740 776 35425
421 Cafe Other 4456 5266 13227 25 6818 1393 31185
ID Channel Region Fresh Milk Grocery Frozen Detergents Delicassen Total
24 Retail Other 26373 36423 22019 5154 4337 16523 110829
88 Cafe Other 43265 5025 8117 6312 1579 14351 78649
182 Cafe Other 112151 29627 18148 16745 4948 8550 190169
184 Cafe Other 36847 43950 20170 36534 239 47943 185683
203 Cafe Lisbon 25203 11487 9490 5065 284 6854 58383
266 Cafe Lisbon 5909 23527 13699 10155 830 3636 57756
326 Cafe Oporto 32717 16784 13626 60869 1272 5609 130877
385 Cafe Other 10683 21858 15400 3635 282 5120 56978
436 Cafe Other 29703 12051 16027 13135 182 2204 73302

Do you want to know more? Actually how similar are customers in a given segment? How similar is a segment to its neighbors? I can tell you about this personally. I also found 3 more segments that arise questions and opportunities.

Contact: Olga Lalakulich olalakul@gmail.com

References


  1. https://archive.ics.uci.edu/ml/datasets/Wholesale+customers The data set is originated from a larger database referred on: Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon